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Abstract. - Complex systems research is becomingly increasingly data-driven, particularly in the 
social and biological domains. Many of the systems from which sample data are collected feature 
structural heterogeneity at the mesoscopic scale (i.e. communities) and limited inter-community 
diffusion. Here we show that the interplay between these two features can yield a significant bias 
in the global characteristics inferred from the data. We present a general framework to quantify 
this bias, and derive an explicit corrective factor for a wide class of systems. Applying our analysis 
to a recent high-profile survey of confiict mortality in Iraq suggests a significant overestimate of 
deaths. 
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Introduction. — Monitoring large social or biologi- 
cal systems bears similar challenges to monitoring many- 
particle systems in physics. The increasing availability of 
data on human behaviour from information and commu- 
nication technologies [1,2] and data from high throughput 
techniques in biology enable scientists to study these di- 
verse systems with similar methodologies. Many biologi- 
cal and social systems are not internally homogeneous, but 
instead feature time-dependent community groupings and 
limited inter-community mixing [1-4]. Individuals form 
dynamic groups in professional and private settings re- 
flected in, for example, structures of scientific collabora- 
tion and mobile phone call patterns [3]. The cell nucleus 
consists of multiple compartments with different micro- 
environments that exist in spatially localised regions in 
the heterogeneous intranuclear space [4]. This problem 
setup is similar to that in so-called metapopulation mod- 
els, which involve spatially structured populations and are 
commonly used in ecology and epidemiology [5,6]. In this 
Letter, we quantify the consequences of sampling a subset 
of objec;ts in such a system. Starting with a general the- 
oretical framework, we show that the interplay between 
heterogeneity and limited diffusion can yield a substan- 



tial bias in the inferred global characteristics. We obtain 
an explicit corrective factor to offset a bias that occurs 
if the structural heterogeneity of the system and the lim- 
ited internal diffusion within the system are not taken into 
account in the initial data sampling. We then consider a 
special case of this general framework, in which the correc- 
tive factor turns out to only depend on three parameters. 
Two of these parameters are associated with heterogeneity 
and one with diffusion. Finally, we consider the specific 
example of a recent conflict mortality study in Iraq, and 
show that a considerable positive bias likely arose in the 
inferred mortality numbers. 

General framework. — Consider a large system 
made up of N particles characterised by a microscopic 

state variable Xi. The system is heterogeneous in that 
it consists of m different subsystems or communities 
, . . . , Sm with Ni particles in Si such that Ni + . . . + 
N„i = N. The subsystems are interconnected in some 
limited way, thereby allowing for only partial diffusion or 
mixing of particles between them. We wish to learn about 
the state of the system described by the extensive macro- 
scopic variable X = Xlili but, in line with typical em- 
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Fig. 1: (Colour online) (a) The system is prepared by tagging 
some particles in some of the subsystems, which corresponds to 
a sampling process. The particles are non-interacting and in- 
distinguishable apart from the initial subsystem given by their 
colour, (b) After the initial state, the matrix f 7^ I quan- 
tifies mixing between subsystems. It can be interpreted as a 
weighted and directed network adjacency matrix of the subsys- 
tems. The state of particle Xi £ {0, 1} is indicated by colouring 
its circumference black or white, respectively. Only tagged par- 
ticles are visible and available for analysis. 



pirical scenarios, assume that we cannot observe the entire 
system. Instead, we monitor the state of a set of tagged 
particles in different subsystems and use this data to make 
statistical inferences about X. 

Let us assume that the particles are identical and non- 
interacting and that each can be in one of two states 
Xi € {0, 1}. The system is initially prepared with Xi — 
for all i and only irreversible — > 1 changes are considered. 
Microscopic state changes are subsystem specific, with the 
element qk of the vector q specifying the probability for a 
particle in subsystem 5*^ to change state. Hence the Xi are 
independent and identically distributed random variables 
within a given subsystem and we let denote a random 
variable having the distribution of any Xi present in Sk- 
The state of a particle can be identified with, for exam- 
ple, the staining of cancerous cells in a biological organism 
under medical imaging (stained vs. clear), or the disease 
status of an individual (healthy vs. diseased). The sub- 
system specific probabilities q could arise from there being 
different numbers of cancerous cells or pathogens in these 
systems. 

The mixing of particles is governed by the constant mix- 
ing matrix f = [/1/2 • • ■ fm], where fi specifies the fraction 
of time particles initially placed in Si spend in other sub- 
systems. The entries of f can be interpreted as probabili- 
ties of finding particles in different subsystems (see Fig. 1). 
The diagonal elements fa correspond to the probability of 
finding a particle in its initial subsystem. Note that f 
does not need to be symmetric. In the limit as the mobil- 
ity of the particles tends to zero, the matrix f consists of 
only diagonal elements fij — Sij, with the effect that the 
subsystems become completely isolated. 

Denote by Xi the contribution of all particles initially 
in Si towards X, and by Xij the contribution of a single 
particle j = I, . . . , Ni initially in Si towards Xi . Let Dik 
denote the number of particles initially in Si which are 



observed in Sk', then Dik follows a binomial distribution 
with parameters Ni and fik- We write our quantity of 
interest X as 



X 



Y^x^ 

1=1 



Ni 



j=i j=i 



(1) 



Consider now a situation where, to estimate X, we can 
draw samples from only some of the subsystems. Let S 
consisting of m subsets Sk denote the set of all subsystems 
and let S' — IJfcLi ^k, i-e. the first m' of these sets, denote 
the set of samplable subsystems. The expectation value 
of X in the entire system, {X)s, and in the samplable 
system, (X) s> , is given by 

m m m rn m 



{X)s = ^(x,) = ^^(Afe)(yfe) = E^^E/^^(2/^) 

ra in rn 

(X)5' = ^(X,)=5^7V,5]/,,(yfc), (2) 



j=i k=\ 



respectively. If the subsystems are heterogeneous and this 
is not accounted for in the sampling procedure, we may 
incur a significant bias. To quantify this, we define the 
bias factor R as the scaled ratio of {X)s' to {X)s, 



R = 



(3) 



where N' < N is the number of particles in S' and values 
of i? > 1 (i? < 1) correspond to overestimating (underes- 
timating) the expectation value of X in the system when 
sampling is based on subsystems in S' only. 

Specialised framework. — A special case of the 
framework arises when the microscopic state variables Xi 
and Uk correspond to independent Bernoulli trials related 
to some event uj. We assume that the event lu occurs inde- 
pendently of the mixing. Now qi (1 — (j,j) is the probability 
of observing a; = 1 (a; = 0) in subsystem Si long enough af- 
ter the initial state so that the system has reached an equi- 
librium. Regardless of the number of subsystems present, 
the system can always be divided into a samplable sub- 
system and a non-samplable subsystem. Let Sj = S' and 
let the remaining subsystems form the non-samplable sub- 
system So = UfcLm'+i ^k- As a mnemonic, the subscript 
/ refers to in-sample and O to out-of-sample. Note that 
whereas before S' C S, here SiHSq = 0, so that although 
Si = S', So + S. We now have Ni = N' = Y,kLi and 
No — J2T=m'+i ^k, corresponding to the number of parti- 
cles in Si and So, respectively, and Ni + No — N. We de- 
fine the 'renormalised' probabilities qi — Nj^^ X^feLi ^k Qk 
for a particle to be subjected to uj while present in Si and 
its complement I — qi for the particle to not be subjected 
to LU while present in Si. Similarly, we define for So the 
probability qo = Nq-^ J2'k=m'+i ^k Qk (and its comple- 
ment 1 — qo) for a particle to (not) be subjected to lo 
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while present within So- Finally, we define the mobility 

factors such that // (/o) is the probabiUty for a particle 
initially placed in Sj (So) to be present within Sj (So), 
and 1 — // (1 — /o) is the probability for a particle initially 
placed in Sj (Sq) to not be present within Sj (Sq), i-e., 
to be present within So {Si). These are written as 



m ni 



rn m 

fo = E E f^^- 



(4) 



We now define naff with a, /3 G {0,I} as the probability 
that a particle picked uniformly at random was placed 
initially in Sa with Xi = and changes state to = 1 in 
S0. This leads to ttqo = jvj+ato QO, t^OI = n!+No 
fo) qi, TT/o = N^+No ~ 90, and tt/j = nH'no 
The sum tt// + tt/o + t^oi + t^oo is the probability that 
a randomly chosen particle is subjected to iv and hence 
changes its microscopic state. The expected number of 
particles with = 1 in a population of size A'' is hence 
No fo qo + No{l- fo) qi + Nj {I - fi) qo + Ni fj qi = 
{qi - qo){fiNi - foNo) + qiNo + qoNj, whereas the 
probability that a randomly chosen particle in 5/ changes 
state is qifi + qo{^ — fi)- Hence the expected number of 
realisations for a population of size N, based on the rate 
for Si only, would be {Ni + No)[qifi + qo{l - //)]• We 
obtain 



R = 



(,Vj + -\'(y)[qifi + gojl - fi)] 
{qi - qo){fiNi - foNo) + qiNo + qoNi ' 



(5) 



Assuming that Ni and qo ^ 0, and setting q = qi/qo 
and n = No/Nj, we obtain 



R= R{fi,fo,q,n) = 



(l + n)(l + g/z-/z) 
iq-l)ifi-fon) + qn+l 



(6) 



Hence the bias factor R depends only on //, fo, and the 
ratios q = qi/qo and n = No/Nj. Finally, in the case of 
symmetric mobility with fj = fo = /, the above expres- 
sion simplifies to 



R = R{f,q,n) 



(l + n)(l + 9/-/) 



(7) 



f{q~l){l-n)+qn + l 

The no-bias limit of i? = 1 requires either (1) n = 
(i.e. No = 0) implying that no particle is placed initially 
in So, or (2) q = I (i.e. qj = qo) implying equal rates 
of changing state in Sj and So, or (3) / = 1/2 which 
suggests that particles based in 5/ spend on average half 
of their time in So aiid vice versa. Setting R{f,q,n) — r 
for general r and solving for q in terms of n and / yields 



q{f,n,r) 



/(I + n + nr — r) + r^n— 1 
/(I + n + nr — r) — nr 



(8) 



Although q is unobservable, we can estimate q = 
N-' Eij and q' = (TV')"' E™'i Ef=i ^ij, leading to 
the asymptotically unbiased estimator R = q'/q for the 
bias factor R. If i? = 1 then we would expect that Rfv 1. 
The variation in R can be assessed via a normal approx- 
imation [9]. Basing q' on Sj and assuming that {X)s is 
not too small, the approximate variance is 

Var(A) « ^±^^1^ , X 

qoNi{f{q-l)il-n)+qn+l)' 

(/<Z(1 - qi) + (!-/)(!- qo) + /(I - f){qqi + qo - q))- (9) 

Application to conflict mortality. — We will now 
exemplify the above framework by applying it to study 

conflict mortality. To estimate the number of deaths in a 
conflict, one would ideally like to have access to a complete 
national list of households from which a sample could be 
drawn at random. Even when this scenario is feasible, the 
selected households are widely scattered, which is costly 
not only in terms of time and money, but also exposes 
the researchers to high levels of risk. To overcome these 
concerns, recent studies economise resources by using a 
cluster sampling methodology. This hierarchical sampling 
process involves making choices on how to choose large 
geographic areas and how to proceed from them to indi- 
vidual households. 

We can equate particles in the framework with indi- 
viduals such that the system size N corresponds to the 
population of the country and the state of each particle 
Xi G {0, 1} corresponds to the individual being alive or 
dead (where the death has resulted from conflict related 
violence), respectively. The different subsystems corre- 
spond to heterogeneous areas that are characterised by 
varying levels of conflict related violence such that the 
probability for an individual to be killed when he or she 
is in Sk is given by qk regardless of where his or her resi- 
dence is located. Note that these areas, or zones, may be 
fragmented and inter-dispersed. Now (X^) corresponds to 
the expected number of deaths in Sk for a given qk, and 
{X) corresponds to the expected number of deaths in the 
country. Daily human movement between different areas 
is quantified by the mixing matrix. The initial subsystem 
of a particle can be identified with the residential zone 
of the individual. The 'renormalised' systems 5/ and So 
correspond to sets of subsystems that may or may not be 
sampled, respectively, given the sampling method. To in- 
clude an individual in the study, his or her home needs to 
be located in the samplable subsystem Sj. 

Let us consider a situation in which data has been col- 
lected using a sampling procedure and we are concerned 
that this sampling procedure may not be sufficiently sen- 
sitive to the structural heterogeneity of the system and 
the limited internal diffusion within the system, so that a 
systematic bias may arise. We can then use the proposed 
framework, after the initial data collection, to offset the 
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Fig. 2: (Colour online) Structural heterogeneity in a social sys- 
tem under conflict. The map shows the average homicide rate 
according to censual sectors in Bogota, Colombia, in the pe- 
riod 1997-1999. Source: Institute National de Medicina Legal. 
Figure adapted from [12]. 



bias resulting from not having taken these factors fully 
into account. 

Structural heterogeneity between subsystems in the con- 
text of conflict mortality is exemplified in Fig. 2, which 
shows how urban violence varies from neighbourhood to 
neighbourhood, in this case in Bogota, Colombia. Sim- 
ilar patterns hold for cities worldwide [13]. While these 
data are based mostly on criminal violence as opposed 
to conflict violence, it is plausible that, similarly, a spa- 
tially inhomogeneous pattern holds for conflict violence. 
Each cell in the map can be associated with one of the 
k = 1, . . . , m subsystems and the colouring reflects a re- 
alised value of Xk- In a completely homogeneous system 
with gi = ■ ■ ■ = we have (Xi) — ■ ■ ■ — (Xm) and would 
expect to observe less fluctuation in the values of Xk- We 
conjecture that structural heterogeneity is likely to hold 
in conflict areas. 

Limited internal diffusion between subsystems in the 
context of conflict mortality is exemplified in Fig. 3, which 
shows the location of residence of the victims (horizontal 
axis) and the location of attacks (vertical axis) in a confiict 
in Thailand. This matrix can be interpreted to refiect the 
underlying diffusion matrix f and it is useful to consider 
two limiting cases. First, if the matrix were completely 
scattered, there would be no correlation between the lo- 
cation of residence and the location of violence. In this 
case the choice of sampling locations and the locations of 
violence are uncorrelated, and one might wish to choose 
sampling locations that are easily accessible. These sam- 
pling locations might be inherently more or less violent 
than the system at large but, due to extensive mobility 
of individuals, the choice of sampling locations would not 
induce a systematic bias. Second, if the matrix were per- 
fectly diagonal, there would be a one-to-one correlation 




Fig. 3: (Colour online) Limited internal diffusion in a social 
system under conflict. The relationship between the residence 
of casualties (killings and injuries) and the place where they 
were attacked. The axes correspond to 59 distinct spatial lo- 
cations listed in identical order, such that the horizontal axis 
represents the residence of the casualties while the vertical axis 
represents the place where the incident occurred. The data are 
from a conflict in Thailand and they are based on a hospital 
monitoring system. The bubble plots reflect the number of 
casualties in each area. Figure adapted from [10]. 



between the location of residence and the location of vi- 
olence. If the sampling locations were, say, more violent 
than the system at large, due to lack of mobility between 
subsystems, the overall estimate would be biased upward. 
In both scenarios one would need to take population den- 
sities into account. We conjecture that diffusion between 
subsystems is very limited under confiict. 

We now focus on the final stages of the sampling proce- 
dure that was used estimate confiict mortality in Iraq [7], 
and refer to it as the Cross Street Sampling Algorithm 
(CSS A): (1) Select a "constituent administrative unit" 
proportionally to their estimated population size, (2) se- 
lect a main street from "a list of all main streets" , (3) 
select randomly a residential street from "a list of resi- 
dential streets crossing the main streets", (4) enumerate 
the households on the street, (5) select one household at 
random to initiate the interviewing, proceeding to 39 fur- 
ther adjacent households. Fig. 4 demonstrates that violent 
events tend to be focused around cross-streets. Because 
cross-streets are chosen for sampling, the location of vio- 
lence and the location of sampled sites are correlated by 
means of accessibility. This correlation results in a biased 
estimate of deaths and is further amplified due to minimal 
mixing of populations between the zones. 

To apply the above framework we need values for the 
model parameters. The population parameter n = Nq/Nj 
gives the proportion of population resident in Sq to that 
resident in Sj. Street layouts in Iraq are often irreg- 
ular, hence CSSA will miss any neighbourhood not in 
the immediate proximity of a cross-street. Analysis of 
Iraqi maps suggests n = 10 is plausible [8]. The vio- 
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Fig. 4: (Colour online) A satellite image of Baghdad showing 
the position of attacks that resulted in more than 10 dead. 
"[The attacks] are located as accurately as possible from re- 
ports since 2003. Where an exact location is not possible, in 
areas such as Sadr City, the marker has been placed within 
the district." The locations of the attacks coincide with the 
structure of underlying road network with most attacks taking 
place either on major roads or on roads off major roads. Image 
and quotation adapted from BBC News [11]. 



lence parameter q = qi/qo gives the relative probabil- 
ity of death for anyone present in Si, regardless of their 
zone of residence, to that of So- For conflicts like the one 
in Iraq, violent events tend to be focused around cross- 
streets since they are a natural habitat for patrols, con- 
voys, police stations, parked cars, roadblocks, cafes and 
street-markets. Major highways would not offer such a 
wide range of potential targets - nor would secluded neigh- 
borhoods and, therefore, the streets that define the sam- 
plable region Sj are prime targets for improvised explo- 
sive devices, car bombs, sniper attacks, abductions and 
drive-by shootings [8]. Given the extent and frequency of 
attacks, g = 5 is plausible [8]. The diffusion parameter 
f = fi = fo gives the fraction of time spent by resi- 
dents of Si (So) in Si (So)- Given the nature of the 
violence, travel is limited; women, children and the el- 
derly tend to stay close to home. Consequently, mixing 
of populations between the zones is minimal. Using the 
time people spend in their homes as a lower bound on the 
time they spend in their zones, assuming that there are 
two working-age males per average household of seven [7] , 
with each spending 6h per 24h day outside their own zone, 
yields f = fj = fo = 5/7-^2/7-18/24 = 13/14 [8]. These 
values yield R = 3.0, suggesting that the Iraq estimate [7] 
provides a substantial overestimate of deaths. 

It is clear from Eq. 6 that in order to arrive at an accu- 
rate estimate of R, one needs to have reasonably accurate 
estimates of the parameters fi, /otQ and n. To gauge the 
sensitivity of our result, we perform a simple sensitivity 



analysis by evaluating R for different values of parameters 
in Fig. 5. This shows the effect of relaxing the constraint 
/ = // = fo and it is clear that in the limit of no mobility 
ifi — fo — ^) the bias is greatest. Conceptually speaking, 
the bias emerges from having simultaneously partial local- 
isation of violence (structural heterogeneity) and partial 
localisation of people (limited internal diffusion) . Both of 
these conditions are needed for the bias to emerge, since if 
q = 1 (structural homogeneity) we have R = 1 regardless 
of n and /, and if / = 1/2 (perfect diffusion), we have 
R = 1 regardless of q and n. In general, the shapes of the 
i?-surfaces in Fig. 5 are smooth and the surfaces are mono- 
tonically increasing functions of n and q. In this sense the 
framework is robust to the parameter values. 

A more precise quantification of the bias can be achieved 
within the framework only if the actual micro-level data 
of the conflict study [7] are released, which would enable 
a more precise determination of the model parameters. 
Importantly, this does not entail further data collection, 
which is especially valuable when the survey needs to be 
carried out under extremely difficult conditions. Even re- 
lease of information concerning how many streets are in- 
cluded in "a list of all main streets" in step (2) of CSSA 
would improve the estimate. This is because the defini- 
tion of a "main street" sets the granularity level of the 
system. A shorter list implies that the areas enclosed by 
the streets are bigger, which necessarily decreases mixing 
between areas, and results in an even larger bias. 

Conclusion. — We have presented a framework that 
can be used to gauge sampling bias in systems featuring 
heterogeneity and limited internal diffusion. We have ap- 
plied the framework to a recent conflict mortality study [7] 
to illustrate how one can, after the initial data collection, 
adjust for the bias resulting in sampling such a system. 
We have demonstrated that the conflict mortality study is 
likely to present a high upward bias and, using our frame- 
work, have gauged the extent of this bias using simple 
plausibility arguments for our framework parameters. We 
believe that our approach and assumptions are reasonable 
given the limited information to hand. It appears that 
the results reported in [7] are a substantial overestimate 
of deaths. This finding is compatible with recent inde- 
pendent research. The figures reported in [7] are 3 times 
higher than the Iraq Living Conditions Survey of the UN 
Development Program estimate for the same time period 
(the first 13 months of the war) [14] , 4 times the Iraq Fam- 
ily Health Survey estimate for the same time period [16], 
and 12 times the Iraq Body Count estimate (based on 
media monitoring) for the same time period [15]. Rather 
than opening a debate on the precise extent of the bias, 
we hope that the present work will open up the way to 
further studies aimed at specifying more precisely the in- 
formation needed to improve these estimates. Given that 
many social and biological systems feature structural het- 
erogeneity and limited internal diffusion, our framework 
should prove invaluable for correcting for such biases. 
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Fig. 5: (Colour online) Sensitivity analysis of bias factor R defined by Eq. 6. Each panel shows R — R{fi, fo,q,n) with the 
values of // and fo fixed for each panel. Here // (fo) varies by columns (rows) over the values {0.75, 0.85, 0.95} increasing from 
left to right (bottom to top). The height of the surface from the {n,q) surface, in addition to being given by the z-coordinate 
in the plots, is also colour coded to guide the eye and to emphasise the smoothness of the surfaces. 
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